In this report I will look to highlight the firms that should be focused on by our supervisors using key metrics that I will explain throughout the report. I will keep all my code in the report for audit and reusability purposes as conclusions may change as new data arrives. I will focus this piece on the three main characteristics stated: Firm Size, Changing Business Profile and Outliers. After this I will be performing some machine-learning application on the data using principal component analysis and k-means clustering. I will conclude by writing on the benefits this report would receive from being hosted on a cloud server.

Data Loading and Wrangling

I have used a combination of openxlsx and tidyverse packages to render the xlsx into a R-readable output. Below I have produced a summary of all the variables.

firm year value_type eof_for_scr_m equity_m gwp_m gross_bel_m gross_claims_incurred_m gross_combined_ratio gross_expense_ratio nwp_m net_bel_m net_combined_ratio net_expense_ratio pure_gross_claims_ratio pure_net_claims_ratio scr_m scr_coverage_ratio total_assets_m total_liabilities_m
Length:9120 Length:9120 Length:9120 Min. : -162.25 Min. : -123.610 Min. : -19.78 Min. : -167.82 Min. :-189.936 Min. : -14064 Min. :-16112.4 Min. :-17754.10 Min. : -182.382 Min. :-5468706 Min. :-4495420 Min. :-24817.8 Min. :-33721.0 Min. : 0.000 Min. : -3 Min. : -207.4 Min. : -1487.1
Class :character Class :character Class :character 1st Qu.: 5.98 1st Qu.: 5.798 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 1.794 1st Qu.: 1 1st Qu.: 14.4 1st Qu.: 3.8
Mode :character Mode :character Mode :character Median : 32.09 Median : 34.276 Median : 15.47 Median : 10.48 Median : 3.392 Median : 0 Median : 0.1 Median : 7.95 Median : 4.502 Median : 0 Median : 0 Median : 0.2 Median : 0.1 Median : 14.533 Median : 2 Median : 113.0 Median : 63.0
NA NA NA Mean : 455.68 Mean : 480.865 Mean : 853.94 Mean : 259.07 Mean : 116.696 Mean : 1859 Mean : 273.7 Mean : 689.12 Mean : 164.284 Mean : 172 Mean : 1370 Mean : 463.0 Mean : 388.1 Mean : 308.437 Mean : 1301158 Mean : 6333.0 Mean : 6363.4
NA NA NA 3rd Qu.: 165.65 3rd Qu.: 176.937 3rd Qu.: 198.96 3rd Qu.: 159.34 3rd Qu.: 79.463 3rd Qu.: 1 3rd Qu.: 0.3 3rd Qu.: 125.95 3rd Qu.: 93.771 3rd Qu.: 1 3rd Qu.: 0 3rd Qu.: 0.5 3rd Qu.: 0.5 3rd Qu.: 95.191 3rd Qu.: 3 3rd Qu.: 955.4 3rd Qu.: 655.7
NA NA NA Max. :41636.30 Max. :26705.042 Max. :74078.64 Max. :19292.07 Max. :6844.014 Max. :3978266 Max. :575064.8 Max. : 75526.67 Max. :11351.609 Max. : 2445529 Max. : 2691212 Max. :985011.3 Max. :828809.4 Max. :22788.359 Max. :999302877 Max. :553549.9 Max. :494498.5
NA NA NA NA’s :2620 NA’s :2620 NA’s :2620 NA NA NA NA NA’s :2620 NA NA NA NA NA NA’s :2620 NA’s :2620 NA’s :2620 NA’s :2620

From this summary we can see each of the columns are in the correct data type and there is significant variance across the numeric variables. Due to the extreme outliers this will influence the statistical metrics I used for my analysis as I will require more variance-robust metrics.The dataset looks like this now:

firm year value_type eof_for_scr_m equity_m gwp_m gross_bel_m gross_claims_incurred_m gross_combined_ratio gross_expense_ratio nwp_m net_bel_m net_combined_ratio net_expense_ratio pure_gross_claims_ratio pure_net_claims_ratio scr_m scr_coverage_ratio total_assets_m total_liabilities_m
Firm 1 2016 mean_value 484.0712 249.6647 9.353606 0.1177988 0.0011030 0.1350101 15.45025 -3550.820 1.978919 14.25977 8.202612 3.506938 3.436862 404.2644 11177400.6440186 12737.29 8034.337
Firm 1 2016 median_value 0.0000 0.0000 0.000000 0.0000000 0.0000000 0.0000000 0.00000 0.000 0.000000 0.00000 0.000000 0.000000 0.000000 0.0000 0.0000000 0.00 0.000
Firm 1 2016 sd_value 1080.2952 555.8116 20.915299 0.2634062 0.0024665 0.3018918 34.54781 7939.875 4.424999 31.88581 18.341598 7.841752 7.685056 903.9626 24993427.5609242 28476.09 17965.323
Firm 1 2016 time_series_value 2416.5598 1243.9250 46.768030 0.0000000 0.0000000 0.0000000 0.00000 -17754.100 0.000000 0.00000 0.000000 0.000000 0.000000 2021.3219 0.1625197 63676.87 40171.683
Firm 1 2017 mean_value 484.0712 249.6647 9.353606 0.1177988 0.0011030 0.1350101 15.45025 -3550.820 1.978919 14.25977 8.202612 3.506938 3.436862 404.2644 11177400.6440186 12737.29 8034.337

I have set up the datatable so that only value_type time_series_value varies over time, all the other variables in this column stay consistent. This makes it straightforward to wrangle to find the top/bottom n firms or most/least varying firms by variable as you will see in my visualization code.

Caveats

There are some caveats which will be worth considering with my data-centric approach to resource allocation, these are:

Correlation Plot

Firm Size Plots

To consider firm size I will look at:

As we want to consider both magnitude and consistency of the variables, I will use the median score to select the highest rated firms and will plot them against the rest.

From this graph we can see there are five firms that are distinctly larger than the rest and I have only considered the top 20% of firms by median value. It is worth highlighting, and I will explore later, the volatility of some of the firms accounts as we can see firm 311 and 210 have large drops in 2020 (YE).

Only Firms 210 and 4 are considerably larger than the rest of the top 20% of firms (for this variable).

Firm 4 has appears twice now, firstly for having a significantly larger Net Written Premium and now for Equity. This suggests to us that it is a largely profitable insurance firm with a large equity holding meaning it is in a very healthy financial position. I would suggest this firm is definitely an important player in the market and one to monitor.

Firm Size Score Table

firm size_score nwp_m net_bel_m pure_gross_claims_ratio
Firm 188 582412.8404 -127.462353 12.765106 1747353.2183443
Firm 210 3024.4069 8971.023310 94.466754 7.7306263
Firm 4 2342.8611 7052.279994 13.346004 -37.0426236
Firm 26 1880.1809 5495.078203 141.915499 3.5489176
Firm 72 1476.9343 74.731447 1.834567 4354.2369564
Firm 105 1308.1439 2920.323677 996.200227 7.9077405
Firm 28 1258.6591 7.048297 5.674840 3763.2540889
Firm 284 969.9516 12.125277 11.191485 2886.5379034
Firm 25 639.4467 1652.469218 265.069667 0.8012794
Firm 311 617.2461 1846.587772 4.626903 0.5234772

I have created a table here that provides a potential list of the top 10 firms. I have calculated this by taking the mean modified z score for Net Weighted Premium, Net BEL (inc. TPs as a whole, pre-TMTP) and Pure Gross Claims Ratio. Firstly, I chose to use the modified z-score (you can find more here) as it is more robust to extreme outliers than the standard z score and it gives us a good idea on the position of the metrics for each firm when compared to the population. I chose those three variables because collectively these are the three variables that are highly positively correlated with every variable in the dataset and so roughly they will contain the information of the dataset. As you can see this needs to be read with some scepticism still as there are major outliers that disrupt the results, for example Firm 188’s Pure Gross Claims Ratio value.

Changing Business Profiles Graphs

Volatility of a firm’s accounts are of considerable concern to Supervisors and so it’s important to investigate.

I will start with considering variation the metric level. I am going to use relative standard deviation as this will allow me to compare standard deviations across each of the variables effectively.

This chart shows us that Relative Absolute Standard Deviation (RASD) varies across metrics over time and that there are multiple metrics with high variance however there are six main metrics that have a high relative standard deviation all of which are ratios these are: gross_expense_ratio, net_expense_ratio, gross_combined_ratio, net_combined_ratio, pure_gross_claims_ratio, and pure_net_claims_ratio. Although there are 6 they are 3 highly correlated net and gross pairs. There are also strong relationships by definition. net_combined_ratio is \(\frac{Sum of net claims and expenses incurred}{Net Earned Premium}\) while net_expense_ratio is just the expenses part of the equation suggesting this could be an expenses variance. We also see a massive spike in 2016 for RSD in gross_expense_ratio which I would suspect is misreporting especially given no spike in net_expense_ratio and I will explore later. I will look into the firms that have produced these high variations in the net variables as these will likely explain the variations in the gross variables also.

We can see from this graph that net_expense_ratio spikes has potentially come from these two enormous values that have been reported. I will add a filter to this chart and the other ratio charts to have median values between 0 and 0.87 (this is the range set by 3Q + 1.5 * IQR).

After filtering out misreporting firms, we can see the five firms whose data has changed most substantially over the period 2016-20. This is important for supervisors to be aware of because its ability to damage investors confidence in the firm as well as conduct any reasonable long-term financial planning.

As Net Combined Ratio indicates profitability of a firm this is a very important metric and we can see that multiple firms with volatile ratios. Firms 287, 291 and 417 being in the top 5 most volatile (highest standard deviations) and high net combined ratio (above 1.5) would be a large cause for concern for supervisors as this suggests they are very unprofitable and it is very difficult to forecast their future profitability.

This is calculated as \(\frac{Net Claims}{Net Earned Premium}\) and so a firm like 38 and 253 who has managed to reduce this from a poor position is a positive and something that would stop concern from supervisors. 60, 142, and 306 are concerning for the same reasons as previously mentioned, high variance and currently very high proportionate costs.

Outliers and Misreporting

I’ve shown a couple of techniques I have used to filter out outliers and highlight elements in the time series. I have chosen to use the interquartile range method due to the presence of extreme outliers in the data. Standard Deviation is influenced much more strongly by extreme outliers. I have created a flagging system, if the data point is outside of \((Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR,)\) for the entire metric population then it is a ‘IQR Outlier’, if it is outside of the 2nd or 8th quantile then it is labelled a ‘Decile Outlier’ else it has ‘No Flag’. I will provide an example below for firm 12.

We can see there is fair homogeneity in submissions and there is no systemic reporting issues towards any particular metric.

This approach will allow users to do is to flagged submissions quickly, if we take Firm 139’s scr_m and Firm 116’s gross_bel_m submissions we can see that they have a range of flags throughout the series.

firm year scr_m.x nwp_m.x scr_m.y nwp_m.y
Firm 139 2016 No Flag IQR Outlier 148.44436 700.60412
Firm 139 2017 No Flag IQR Outlier 102.00252 678.88901
Firm 139 2018 IQR Outlier IQR Outlier 384.08633 1013.57043
Firm 139 2019 No Flag Decile Outlier 102.85466 208.04965
Firm 139 2020 Decile Outlier IQR Outlier 137.17073 468.09469
Firm 222 2016 No Flag IQR Outlier 157.07521 223.59939
Firm 222 2017 No Flag IQR Outlier 92.17920 423.15529
Firm 222 2018 IQR Outlier Decile Outlier 286.40581 291.80749
Firm 222 2019 No Flag No Flag 66.80772 153.44594
Firm 222 2020 No Flag No Flag 78.87480 79.30072

If we graph that, we can see how this flagging approach works in practice:

This method allows us to look at each firm’s submission against the rest of the population and detect whether we consider it an outlier or not. In this case we can safely assume the IQR Outlier will require a resubmission. Utilising IQR outlier markers shows a clear picture on where an individual’s firm submissions deviate significantly from the population.

For the machine learning calculations, I am going to remove all IQR outliers to stop poor data quality impacting the outputs.

filtered_df = dplyr::anti_join(
  x = df %>%
    filter(value_type == 'time_series_value') %>%
    pivot_longer(cols = -c('firm', 'year', 'value_type')),
  y = outliers_df %>%
    pivot_longer(cols = -c('firm', 'year')) %>%
    filter(value == 'IQR Outlier'),
  # remove any that match firm , year and name
  by = c('firm', 'year', 'name')
  ) %>%
  # we are going to describe the whole time series so just need firm and metric
  dplyr::group_by(firm, name) %>%
  dplyr::summarise(
    # this are the variables I will use (for now)
    median_value = median(value, na.rm = T),
    # mad_value = mad(value, na.rm = T)
  ) %>% 
  dplyr::filter(
    !is.na(median_value)
  ) %>%
  tidyr::pivot_wider(
    names_from = name, 
    values_from = median_value) %>%
  # remove any rows with NAs - this will reduce the dataset but will lead to cleaner results
  stats::na.omit()

 kable(head(filtered_df)) %>%
  kable_styling("striped", full_width = F) %>% 
  scroll_box(width = "100%")
firm eof_for_scr_m equity_m gross_bel_m gross_claims_incurred_m gross_combined_ratio gross_expense_ratio gwp_m net_bel_m net_combined_ratio net_expense_ratio nwp_m pure_gross_claims_ratio pure_net_claims_ratio scr_coverage_ratio scr_m total_assets_m total_liabilities_m
Firm 1 0.00000 0.00000 0.00000 0.0000000 0.0000000 0.0000000 0.00000 0.0000000 0.0000000 0.0000000 0.00000 0.0000000 0.0000000 0.0000000 0.000000 0.00000 0.000000
Firm 100 24.31249 132.55980 45.90523 66.2708894 0.8583713 0.2994157 29.95186 102.0876430 0.2561299 0.3134123 22.37252 0.6318334 0.1521047 0.9063138 34.142618 406.01092 40.392517
Firm 101 0.00000 0.00000 0.00000 0.0000000 0.0000000 0.0000000 0.00000 0.0000000 0.0000000 0.0000000 0.00000 0.0000000 0.0000000 1.4076342 0.000000 0.00000 0.000000
Firm 102 0.00000 0.00000 -16.43230 137.9733993 1.1148258 0.3940367 0.00000 -8.0739364 1.2914292 0.6520511 0.00000 0.4760284 0.3427111 0.0000000 0.000000 0.00000 0.000000
Firm 103 29.04979 22.21371 0.00000 0.0000000 0.0000000 0.0000000 0.00000 0.0000000 0.0000000 0.0000000 0.00000 0.0000000 0.0000000 2.5955507 1.005624 64.46984 2.291058
Firm 106 76.30481 182.72545 12.16381 0.0443066 0.0000000 0.0000000 117.99910 0.8681491 0.0000000 0.0000000 127.69550 0.0000000 0.0000000 0.9292053 55.887545 116.58011 0.000000

Machine Learning Application

One of the strengths of the PRA’s supervisory infrastructure is that we are able to categorise firms based on their size, business model, area of operation etc. Each of these characteristics require their own supervisory approaches which the PRA can adapt to and enables more effective regulation. For my machine learning application I will look to use K-means clustering to put the banks into various distinct categories. Firstly I will extract the important data from the features using principal component analysis (PCA). PCA can identify the underlying structure of the data, which can make it easier for the clustering algorithm to find the true clusters. I have chosen this cleaning approach before clustering because as a dimension reduction process it can help to extract the most important elements of each of the features and reduce the dataset into a more comprehensible size. With the multicolinearity problems that are shown at the start, this approach will mitigate this issue and improve cluster quality. With tidymodels we can easily create a pipeline from input to PCA output

pca_recipe <- recipes::recipe(firm ~ ., data = filtered_df)  %>%
  # step_naomit(all_numeric()) %>%
  # we only want to sue complete columns
  recipes::step_filter_missing(
    all_numeric(),
    threshold = 0) %>%
  # normalize all columns
  recipes::step_normalize(all_numeric()) %>%
  # compute principal components
  recipes::step_pca(
    all_numeric(), 
    threshold = .95
    )

pca_prep = recipes::prep(x = pca_recipe, training = filtered_df)

# plot this and cluster
pca_output_df = recipes::bake(object = pca_prep, filtered_df)

 kable(head(pca_output_df)) %>%
  kable_styling("striped", full_width = F) %>% 
  scroll_box(width = "100%")
firm PC01 PC02 PC03 PC04 PC05 PC06 PC07 PC08 PC09 PC10 PC11 PC12 PC13 PC14
Firm 1 1.866716 1.9426909 -1.0615485 0.3023471 -0.0124155 0.0329275 0.0477477 -0.3485232 0.0700634 -0.1339448 0.1071496 0.0541784 -0.0788456 0.0135459
Firm 100 -2.060271 -0.5487977 -0.4889875 0.3356351 -0.5212740 0.5056186 -0.7245970 -0.5003068 0.5873495 0.4570771 -0.9099582 -0.2912519 0.3398024 -0.7756781
Firm 101 1.895210 1.9149604 -0.1456495 0.0809902 -0.1570126 -0.1127410 -0.0467271 -0.1700240 0.0580823 -0.0230969 0.0248370 -0.0254098 -0.0319223 0.0248264
Firm 102 -3.687353 0.9898153 -1.2237018 -2.5142850 0.5038193 0.6540177 -1.3928494 -1.7007434 1.2900024 0.0774827 0.0787834 0.7182910 -0.5373128 -0.0845482
Firm 103 1.969936 1.5653151 0.7077100 -0.0835912 -0.2048694 0.0241829 -0.0922690 -0.0136316 -0.0311975 0.0342323 0.0061932 0.0029417 0.0541043 0.0109580
Firm 106 2.353587 -1.6336331 -0.2567017 1.1294112 1.5942964 -0.6787836 -0.5475134 -0.4025357 0.3681808 0.4919000 -0.0283763 -0.4576122 -0.3439228 -1.1386102
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17
Standard deviation 2.411389 2.092669 1.035251 0.9175363 0.8528295 0.8201551 0.7019183 0.6781957 0.6407285 0.5959858 0.5921662 0.5266816 0.5214357 0.4804566 0.4783704 0.464964 0.4451537
Proportion of Variance 0.342050 0.257600 0.063040 0.0495200 0.0427800 0.0395700 0.0289800 0.0270600 0.0241500 0.0208900 0.0206300 0.0163200 0.0159900 0.0135800 0.0134600 0.012720 0.0116600
Cumulative Proportion 0.342050 0.599650 0.662690 0.7122200 0.7550000 0.7945700 0.8235500 0.8506100 0.8747500 0.8956500 0.9162800 0.9325900 0.9485900 0.9621700 0.9756300 0.988340 1.0000000

This provides us with 14 principal components that explain over 95% of the total variance of the dataset. This suggests that there is little explanatory power in this method however I will still investigate merits.

I am now going to plot PC1 against PC2 to see if there are obvious clusters before performing k-means clustering.

There are few discernible groups using just PCs 1 and 2 alongside ggplot2’s geom_bin2d function to highlight clusters. I will use k-means clustering, an unsupervised ML technique used to group these points together minimising within-cluster variance and it is an incredibly efficient and explainable algorithm. The latter point is important when sharing findings with non-technical colleagues.

What this shows us is the loadings of each of the PCs, we can see that PC1 is largely influenced by multiple of the metrics and PC2, the rest. As we expect, in cases where there is high pure_net_claims_ratio there will also be a high pure_gross_claims_ratio which matches our earlier charts. These two PC loadings agree with what we saw in the correlation matrix where there are examples of strong positive multicolinearity but no strong negative multicolinearity or indeed just colinearity.

K-means clustering is an iterative process that assigns each data point to a group (randomly at first) and are incrementally clustered based on similarities to other points. Firstly, K ‘centroids’ are positioned randomly and the data points are assigned to closest centroid (based on euclidean distance traditionally). The centroids are then repositioned based on the average position of it’s constituents and the process goes again. This process stops when the clusters do not change further.

This chart above shows us how the groups form with different values of K, due to the shape of the population (part due to cleaning outliers and using median values instead of investigating temporally) the groups begin difficult to disentangle when K > 5 when only looking at PCs 1 and 2.

There is no obvious optimal K value for grouping these columns. As WSS score reduces to a more acceptable level, there are so many groups that any inference is lost. In order for PCA to be successful in either inference or prediction the data will need a more effective clean to allow for the algorithms to be more successful. Due to my cleaning methods the dataset was reduced which diminishes the power of PCA. Nonetheless, we can provide this information to the supervisors to help inform their choice.

Application of Cloud Technologies

Once we can employ cloud technologies, this project will receive multiple advantages from integrating such tools. Utilising platforms such as Azure Data Factory, which is focused on hybrid data integration, will allow for reduced siloing of data, more effective data ingestion, smarter data pipelines and a cost-effective approach to daily batch processing.

By having a central environment where one can see all databases, like Azure Data Explorer, we reduce the issues of siloing. This enables parties to be more aware of all data the Bank collects, even if access isn’t immediately available due to security reasons. This will improve collaboration across directorships as teams find overlaps in certain workflows and will reduce duplication of data collection. This is important for my report as firstly, it grants a wider readership and benefits more parties. Secondly, it stops duplication from other colleagues due to its visibility and lastly can be considered a golden source as there is no secondary data source coming in and so can promote collaboration across the Bank.

Azure Data Factory will also allow stronger data quality assurance processes as data ingestion will go through this system and outliers or misreported data are more likely to be flagged meaning that my published report can be trusted.

For daily batch processing, the data is ingested and checked by Azure Data Factory. This is then loaded using custom ELT pipelines into the report hosted on a cloud platform that is scheduled to update every morning. We can then monitor this process and identify any issues that we may need to resolve. Overall, hosting the report on a cloud server reduces the requirement for manual processes from myself or other colleagues allowing us to focus on new projects and targets thus increasing productivity. As we have outsourced ingestion and server services, we can save costs and have increased confidence that the servers and databases will be consistently available without outages or issues that arise from internal key-person risks.